Ahmed T. Hammad
  • ℹ️ About
  • πŸ§‘β€πŸ« Teaching
  • πŸ›° Research
  • πŸ§‘β€πŸŽ“ Students
  • ✍ Papers
  • πŸ“ƒ CV
  • πŸŽ™οΈBlog
  • πŸ“Έ Gallery
  • Github
  • LinkedIn
  • Email

Data Science project Boilerplate

A data science boilerplate, in the context of software development and data science projects, refers to a standardized and reusable set of code, templates, libraries, and best practices that are pre-defined and organized to kickstart a data science project. It serves as a foundation or starting point for data scientists and analysts, helping them save time and effort when beginning a new project or analysis. Here’s an ideal data science boilerplate that I recommend for basically any project.

The folder structure

.
β”œβ”€β”€ data
β”‚   └── data.csv
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ Makefile
β”œβ”€β”€ .env
β”œβ”€β”€.envrc
β”œβ”€β”€ .gitignore
β”œβ”€β”€ notebooks
β”‚   └── datascientist_deliverable.ipynb
β”œβ”€β”€ README.md
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ scripts
β”‚   └── script.py
β”œβ”€β”€ setup.py
└── thepkg
    β”œβ”€β”€ __init__.py
    β”œβ”€β”€ interface
    β”‚   └── __init__.py
    β”œβ”€β”€ ml_logic
    β”‚   β”œβ”€β”€ data.py
    β”‚   β”œβ”€β”€ __init__.py
    β”‚   β”œβ”€β”€ model.py
    β”‚   └── preprocessor.py
    β”œβ”€β”€ params.py
    └── utils.py

This folder structure represents a typical directory layout for a data science project. Each folder and file serves a specific purpose in organizing and managing the project’s code, data, documentation, and other resources. I will now give an explanation of each item in the structure:

  • data: This folder contains the project’s data files. In this case, there is a single CSV file named data.csv, but you can add more data files as needed.

  • Dockerfile: This file is used to define the instructions for creating a Docker container for your project. Docker allows you to encapsulate your project environment and dependencies for consistency and portability. This is a key element for most of my projects since I love to delevope inside a Docker Container (see my other blog post on how to use Docker with β€˜bind mounts’.)

  • Makefile: A Makefile contains a set of rules and commands for building, testing, and running various aspects of your project. It can automate common development tasks.

  • .env: This file is often used to store environment variables specific to your project. These variables can include API keys, database connection strings, or other sensitive information.

  • .envrc: This file is typically used in conjunction with a tool like direnv to manage environment variables for your project, ensuring that the correct environment is set up when you enter the project directory.

  • .gitignore: This file specifies files and folders that should be ignored by Git when tracking changes. It helps avoid including sensitive or unnecessary files in version control.

  • notebooks: This folder is meant for Jupyter notebooks used for data exploration, analysis, and documentation. In this case, there’s a single notebook file named datascientist_deliverable.ipynb.

  • README.md: This Markdown file is used to provide an overview and documentation of the project. It typically includes project goals, setup instructions, usage examples, and other relevant information.

  • requirements.txt: This file lists the Python packages and their versions required for the project. You can use it to recreate the project’s environment on another system.

  • scripts: This folder is intended for Python scripts that are part of your project. In this example, there’s a single script file named script.py.

  • setup.py: This is a Python script used for packaging and distributing your project as a Python package and is directly related with the folder `thepkg`. It’s often used when you want to share your code with others or publish it on platforms like PyPI.

  • thepkg: This folder represents the main Python package of your project. In development phase we would install the package using the classic pip install . e (editablemode) to updated function without having to reinstalling it over and over again. The folder is organized in a way that follows Python package conventions:

    • init.py: These files indicate that the directories are Python packages and can be imported as modules.

    • interface: This subpackage appears to be a module for defining an interface or API for your project.

    • ml_logic: This subpackage seems to contain modules related to machine learning logic, including data processing (data.py and preprocessor.py) and modeling (model.py).

    • params.py: This file could contain project-specific configuration parameters or settings.

    • utils.py: This file likely contains utility functions or helper code used throughout the project.

Overall, this folder structure provides a well-organized framework for a data science project, making it easier to collaborate, manage dependencies, and maintain consistency in your work.

Remember that I boilerplate is just a template. Feel free to use this structure as a first step into your new Data Science project.

Finally, here is a little bash script to create the entire folder structure.

Code
#!/bin/bash

# Create the main project directory
mkdir -p my_data_science_project

# Create subdirectories and files
cd my_data_science_project

mkdir -p data
touch data/data.csv

touch Dockerfile
touch Makefile
touch .env
touch .envrc
touch .gitignore

mkdir -p notebooks
touch notebooks/datascientist_deliverable.ipynb

touch README.md
touch requirements.txt

mkdir -p scripts
touch scripts/script.py

touch setup.py

mkdir -p thepkg
touch thepkg/__init__.py

mkdir -p thepkg/interface
touch thepkg/interface/__init__.py

mkdir -p thepkg/ml_logic
touch thepkg/ml_logic/data.py
touch thepkg/ml_logic/__init__.py
touch thepkg/ml_logic/model.py
touch thepkg/ml_logic/preprocessor.py

touch thepkg/params.py
touch thepkg/utils.py